gene_set_extras_usage

We’re going to load in some basic packages. We use clusterProfiler to do the initial gene set analysis and the test data we’re using is human so we’re loading the org.Hs.eg.db annotation package - you’d modify that to use the package for the species you’re using for your data.

We’re using tidyverse for general data loading and manipulation - this isn’t required but it’s nice.

library(geneSetExtras)
library(org.Hs.eg.db)
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> 
library(clusterProfiler)
#> 
#> clusterProfiler v4.18.4 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
#> 
#> Please cite:
#> 
#> G Yu. Thirteen years of clusterProfiler. The Innovation. 2024,
#> 5(6):100722
#> 
#> Attaching package: 'clusterProfiler'
#> The following object is masked from 'package:AnnotationDbi':
#> 
#>     select
#> The following object is masked from 'package:IRanges':
#> 
#>     slice
#> The following object is masked from 'package:S4Vectors':
#> 
#>     rename
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   4.0.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.1.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ lubridate::%within%() masks IRanges::%within%()
#> ✖ dplyr::collapse()     masks IRanges::collapse()
#> ✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
#> ✖ dplyr::desc()         masks IRanges::desc()
#> ✖ tidyr::expand()       masks S4Vectors::expand()
#> ✖ dplyr::filter()       masks clusterProfiler::filter(), stats::filter()
#> ✖ dplyr::first()        masks S4Vectors::first()
#> ✖ dplyr::lag()          masks stats::lag()
#> ✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
#> ✖ purrr::reduce()       masks IRanges::reduce()
#> ✖ dplyr::rename()       masks clusterProfiler::rename(), S4Vectors::rename()
#> ✖ lubridate::second()   masks S4Vectors::second()
#> ✖ lubridate::second<-() masks S4Vectors::second<-()
#> ✖ dplyr::select()       masks clusterProfiler::select(), AnnotationDbi::select()
#> ✖ purrr::simplify()     masks clusterProfiler::simplify()
#> ✖ dplyr::slice()        masks clusterProfiler::slice(), IRanges::slice()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading test data

The format of the data here isn’t important, we just need some way to define the background and hit lists for each group.

read_delim(system.file("hit_clusters.txt", package="geneSetExtras")) -> hit_data

hit_data |>
  group_by(cluster) |>
  count()
#> # A tibble: 8 × 2
#> # Groups:   cluster [8]
#>   cluster        n
#>   <chr>      <int>
#> 1 cluster_11   383
#> 2 cluster_13   626
#> 3 cluster_15   271
#> 4 cluster_19   349
#> 5 cluster_4    822
#> 6 cluster_5    144
#> 7 cluster_8    677
#> 8 <NA>       20975

Running Cluster Profiler

We need to generate an initial set of results from clusterProfiler. We’re doing overlap (rather than quantitative) statistics here. We should use the same settings for each of the tests. It’s important that we don’t filter the hits to keep only significant results but get all of the data for all tested categories.

It’s a good idea to limit the size of the gene sets used for the tests. ClusterProfiler does this by default anyway.

In this instance we’ll just use the GO:BP subsection. We’re just going to use all significant genes in this test, but you can be more selective. Depending on the nature of your data it may make sense to split out upregulated and downregulated genes.

We’ll calculate results just for cluster 4 to illustrate how this works.

enrichGO(
  gene = hit_data |> filter(cluster=="cluster_4") |> pull(Gene),
  universe = hit_data |> pull(Gene),
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL",
  pvalueCutoff = Inf,
  qvalueCutoff = Inf,
  readable = TRUE,
  ont = "BP"
) -> enrich_cluster_4

Plotting

Now we have individual hits we can plot them out using a volcano plot. It’s often useful to make these interactive so that you can put them into Notebooks and mouse over individual hits to see what they are.

volcano_plot(enrich_cluster_4, interactive=TRUE)

Clustering

Next we want to do some categorical clustering to find hits which are similar in their gene content and then pick a suitable term to represent each of them.

cluster_enrich_result(enrich_cluster_4) -> clusterered_results_cluster4
#> Joining with `by = join_by(Description)`
clusterered_results_cluster4
#> # A tibble: 24 × 14
#>    ID    Description term_cluster term_cluster_size GeneRatio BgRatio RichFactor
#>    <chr> <chr>              <int>             <int> <chr>     <chr>        <dbl>
#>  1 GO:0… cell-cell …            1                 2 45/628    233/14…      0.193
#>  2 GO:0… modulation…            2                19 59/628    404/14…      0.146
#>  3 GO:0… regulation…            3                66 55/628    363/14…      0.152
#>  4 GO:0… synapse or…            4                 6 52/628    425/14…      0.122
#>  5 GO:0… adenylate …            5                 6 26/628    167/14…      0.156
#>  6 GO:0… sensory pe…            6                17 44/628    414/14…      0.106
#>  7 GO:0… neurotrans…            7                10 24/628    168/14…      0.143
#>  8 GO:0… response t…            8                 7 37/628    348/14…      0.106
#>  9 GO:0… response t…            9                 4 18/628    109/14…      0.165
#> 10 GO:0… potassium …           10                 6 22/628    170/14…      0.129
#> # ℹ 14 more rows
#> # ℹ 7 more variables: FoldEnrichment <dbl>, zScore <dbl>, pvalue <dbl>,
#> #   p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
cluster_volcano_plot(clusterered_results_cluster4, interactive = TRUE)

Automation

We have several clusters, so it would be nice to show the results for all of them. This isn’t specific to the package, but shows how this could be done.

calculate_cluster_enrichment <- function(data,cluster_name) {
  enrichGO(
    gene = data |> filter(cluster==cluster_name) |> pull(Gene),
    universe = data |> pull(Gene),
    OrgDb = org.Hs.eg.db,
    keyType = "SYMBOL",
    pvalueCutoff = 1,
    qvalueCutoff = 1,
    readable = TRUE,
    ont = "BP"
  )
}

hit_data |> arrange(cluster)|> filter(!is.na(cluster)) |> distinct(cluster) |> pull(cluster) -> cluster_names

lapply(
  cluster_names,
  function(x)calculate_cluster_enrichment(hit_data,x)
) -> all_enrichment_results

names(all_enrichment_results) <- cluster_names

We can then plot out the results for each cluster

for (name in cluster_names) {
  print(volcano_plot(all_enrichment_results[[name]], interactive=TRUE, title=name))
}

We can also generate clustered results

lapply(
  all_enrichment_results,
  cluster_enrich_result
) -> all_clustered_hits
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`

for (i in 1:length(cluster_names)) {
  if (! is.null(all_clustered_hits[[i]])) {
    all_clustered_hits[[i]] |>
      add_column(cluster=cluster_names[i], .before=1) -> all_clustered_hits[[i]]
    
    print(all_clustered_hits[[i]])
  }
}
#> # A tibble: 7 × 15
#>   cluster    ID     Description term_cluster term_cluster_size GeneRatio BgRatio
#>   <chr>      <chr>  <chr>              <int>             <int> <chr>     <chr>  
#> 1 cluster_11 GO:00… extracellu…            1                 4 16/214    284/14…
#> 2 cluster_11 GO:00… blood circ…            2                15 20/214    424/14…
#> 3 cluster_11 GO:00… response t…            3                34 21/214    464/14…
#> 4 cluster_11 GO:00… C21-steroi…            4                14 4/214     16/140…
#> 5 cluster_11 GO:00… positive r…            5                 2 4/214     21/140…
#> 6 cluster_11 GO:00… L-glutamat…            6                 4 4/214     25/140…
#> 7 cluster_11 GO:19… cellular r…            7                 2 7/214     93/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> #   pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 9 × 15
#>   cluster    ID     Description term_cluster term_cluster_size GeneRatio BgRatio
#>   <chr>      <chr>  <chr>              <int>             <int> <chr>     <chr>  
#> 1 cluster_13 GO:00… extracellu…            1                10 37/431    284/14…
#> 2 cluster_13 GO:00… complement…            2                 5 13/431    37/140…
#> 3 cluster_13 GO:00… regulation…            3                11 18/431    128/14…
#> 4 cluster_13 GO:00… skeletal s…            4                 7 35/431    458/14…
#> 5 cluster_13 GO:00… positive r…            5                 2 35/431    478/14…
#> 6 cluster_13 GO:00… renal syst…            6                 9 25/431    287/14…
#> 7 cluster_13 GO:00… vascular e…            7                 5 8/431     43/140…
#> 8 cluster_13 GO:00… oxygen tra…            8                 3 4/431     10/140…
#> 9 cluster_13 GO:00… positive r…            9                 2 8/431     55/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> #   pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 11 × 15
#>    cluster    ID    Description term_cluster term_cluster_size GeneRatio BgRatio
#>    <chr>      <chr> <chr>              <int>             <int> <chr>     <chr>  
#>  1 cluster_15 GO:0… nuclear di…            1                95 76/243    405/14…
#>  2 cluster_15 GO:0… DNA replic…            2                24 39/243    270/14…
#>  3 cluster_15 GO:0… meiotic ce…            3                17 34/243    244/14…
#>  4 cluster_15 GO:0… centromere…            4                 6 11/243    28/140…
#>  5 cluster_15 GO:1… positive r…            5                14 11/243    28/140…
#>  6 cluster_15 GO:0… cell cycle…            6                21 20/243    146/14…
#>  7 cluster_15 GO:0… sister chr…            7                 3 12/243    54/140…
#>  8 cluster_15 GO:0… positive r…            8                 6 9/243     26/140…
#>  9 cluster_15 GO:0… centrosome…            9                 8 15/243    134/14…
#> 10 cluster_15 GO:0… DNA biosyn…           10                 2 14/243    181/14…
#> 11 cluster_15 GO:0… cytokineti…           11                 2 7/243     42/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> #   pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 3 × 15
#>   cluster    ID     Description term_cluster term_cluster_size GeneRatio BgRatio
#>   <chr>      <chr>  <chr>              <int>             <int> <chr>     <chr>  
#> 1 cluster_19 GO:00… roof of mo…            1                21 11/279    83/140…
#> 2 cluster_19 GO:00… negative r…            2                 6 21/279    395/14…
#> 3 cluster_19 GO:00… negative r…            3                 4 6/279     45/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> #   pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 24 × 15
#>    cluster   ID     Description term_cluster term_cluster_size GeneRatio BgRatio
#>    <chr>     <chr>  <chr>              <int>             <int> <chr>     <chr>  
#>  1 cluster_4 GO:00… cell-cell …            1                 2 45/628    233/14…
#>  2 cluster_4 GO:00… modulation…            2                19 59/628    404/14…
#>  3 cluster_4 GO:00… regulation…            3                66 55/628    363/14…
#>  4 cluster_4 GO:00… synapse or…            4                 6 52/628    425/14…
#>  5 cluster_4 GO:00… adenylate …            5                 6 26/628    167/14…
#>  6 cluster_4 GO:00… sensory pe…            6                17 44/628    414/14…
#>  7 cluster_4 GO:00… neurotrans…            7                10 24/628    168/14…
#>  8 cluster_4 GO:00… response t…            8                 7 37/628    348/14…
#>  9 cluster_4 GO:00… response t…            9                 4 18/628    109/14…
#> 10 cluster_4 GO:00… potassium …           10                 6 22/628    170/14…
#> # ℹ 14 more rows
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> #   pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 12 × 15
#>    cluster   ID     Description term_cluster term_cluster_size GeneRatio BgRatio
#>    <chr>     <chr>  <chr>              <int>             <int> <chr>     <chr>  
#>  1 cluster_8 GO:00… sterol bio…            1                22 20/462    58/140…
#>  2 cluster_8 GO:00… endoderm f…            2                 6 10/462    52/140…
#>  3 cluster_8 GO:00… regulation…            3                12 9/462     46/140…
#>  4 cluster_8 GO:00… myofibril …            4                 4 10/462    62/140…
#>  5 cluster_8 GO:00… axon guida…            5                18 19/462    199/14…
#>  6 cluster_8 GO:00… angiogenes…            6                14 33/462    470/14…
#>  7 cluster_8 GO:00… intermedia…            7                 3 9/462     52/140…
#>  8 cluster_8 GO:00… organophos…            8                 4 19/462    205/14…
#>  9 cluster_8 GO:00… organic hy…            9                 9 19/462    231/14…
#> 10 cluster_8 GO:00… negative r…           11                10 7/462     42/140…
#> 11 cluster_8 GO:00… negative r…           12                 2 10/462    88/140…
#> 12 cluster_8 GO:00… ovulation             13                 2 4/462     14/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> #   pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>

Comparison

Next we want to look for terms whose enrichment changes significantly between different datasets. We can do this by passing a list of the enrichment results we want to compare. Only terms which are in all sets can be tested, which is why it’s important that the original enrichment is calculated without a pvalue filter.

*NB There is an issue with clusterProfiler where it doesn’t report categories with no overlaps in the gene set, however you set the filters. This may be fixed in the latest devel version but we’re waiting to see. At the moment these results may under-represent the true hits.

compare_enrichment_results(all_enrichment_results) -> enrichment_differences
#> Joining with `by = join_by(ID)`

enrichment_differences |>
  filter(p.adjust<0.05)
#> # A tibble: 281 × 11
#>    ID         Description               p.adjust   pvalue FoldEnrichment_clust…¹
#>    <chr>      <chr>                        <dbl>    <dbl>                  <dbl>
#>  1 GO:0007059 chromosome segregation    2.53e-49 4.55e-53                  0    
#>  2 GO:0000280 nuclear division          6.32e-41 2.27e-44                  0.649
#>  3 GO:0098813 nuclear chromosome segre… 1.53e-40 8.26e-44                  0    
#>  4 GO:0048285 organelle fission         4.70e-39 3.38e-42                  0.581
#>  5 GO:0140014 mitotic nuclear division  3.32e-34 2.98e-37                  0.738
#>  6 GO:0000819 sister chromatid segrega… 9.01e-34 9.71e-37                  0    
#>  7 GO:0000070 mitotic sister chromatid… 4.88e-33 6.14e-36                  0    
#>  8 GO:0033044 regulation of chromosome… 2.32e-22 3.33e-25                  0.272
#>  9 GO:0051983 regulation of chromosome… 2.20e-20 3.55e-23                  0    
#> 10 GO:0044772 mitotic cell cycle phase… 2.73e-20 4.89e-23                  0.485
#> # ℹ 271 more rows
#> # ℹ abbreviated name: ¹​FoldEnrichment_cluster_11
#> # ℹ 6 more variables: FoldEnrichment_cluster_13 <dbl>,
#> #   FoldEnrichment_cluster_15 <dbl>, FoldEnrichment_cluster_19 <dbl>,
#> #   FoldEnrichment_cluster_4 <dbl>, FoldEnrichment_cluster_5 <dbl>,
#> #   FoldEnrichment_cluster_8 <dbl>

We can see that these are redundant because we’re using all terms to test with.

Let’s pick out only the terms which were in our clustered hits and test just those.

lapply(all_clustered_hits, function(x)x$ID) -> interesting_ids

do.call(c,interesting_ids) |> unique() -> interesting_ids

lapply(all_enrichment_results, function(x)x |> filter(ID %in% interesting_ids)) -> filtered_enrichment_results

compare_enrichment_results(filtered_enrichment_results) -> filtered_enrichment_differences
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Joining with `by = join_by(ID)`

filtered_enrichment_differences |>
  filter(p.adjust<0.05)
#> # A tibble: 48 × 11
#>    ID         Description               p.adjust   pvalue FoldEnrichment_clust…¹
#>    <chr>      <chr>                        <dbl>    <dbl>                  <dbl>
#>  1 GO:0000280 nuclear division          1.43e-42 2.27e-44                  0.649
#>  2 GO:0006260 DNA replication           3.17e-21 1.01e-22                  0.486
#>  3 GO:0051321 meiotic cell cycle        3.42e-16 1.63e-17                  0.538
#>  4 GO:0050804 modulation of chemical s… 7.71e- 8 5.45e- 9                  1.14 
#>  5 GO:0007062 sister chromatid cohesion 7.71e- 8 6.12e- 9                  1.22 
#>  6 GO:0044839 cell cycle G2/M phase tr… 1.01e- 7 9.66e- 9                  0.450
#>  7 GO:0007098 centrosome cycle          3.34e- 7 3.71e- 8                  0    
#>  8 GO:1905820 positive regulation of c… 7.87e- 7 1.00e- 7                  0    
#>  9 GO:0034508 centromere complex assem… 1.11e- 6 1.58e- 7                  0    
#> 10 GO:0098742 cell-cell adhesion via p… 1.92e- 6 3.05e- 7                  1.41 
#> # ℹ 38 more rows
#> # ℹ abbreviated name: ¹​FoldEnrichment_cluster_11
#> # ℹ 6 more variables: FoldEnrichment_cluster_13 <dbl>,
#> #   FoldEnrichment_cluster_15 <dbl>, FoldEnrichment_cluster_19 <dbl>,
#> #   FoldEnrichment_cluster_4 <dbl>, FoldEnrichment_cluster_5 <dbl>,
#> #   FoldEnrichment_cluster_8 <dbl>

Now we’re down to just 48 terms which is a lot more manageable.

Plot comparison results

We can take the comparison results and generate a heatmap which shows the enrichment levels in the different gene sets

plot_differential_enrichment_heatmap(
  filtered_enrichment_differences |> filter(p.adjust<0.05),
  scale=TRUE,
  log_scale = FALSE
)